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Abstract 

Background: The delineation of genomic copy number abnormalities (CNAs) from cancer samples has 
been instrumental for identification of tumor suppressor genes and oncogenes and proven useful for clinical 
marker detection. An increasing number of projects have mapped CNAs using high-resolution microarray 
based techniques. So far, no single resource does provide a global collection of readily accessible oncogc- 
nomic array data. 

Methodology /Principal Findings: We here present arrayMap, a curated reference database and 
bioinformatics resource targeting copy number profiling data in human cancer. The arrayMap database 
provides a platform for meta-analysis and systems level data integration of high-resolution oncogenomic 
CNA data. To date, the resource incorporates more than 40,000 arrays in 224 cancer types extracted from 
several resources, including the NCBI's Gene Expression Omnibus (GEO), EBIs Array Express (AE), The 
Cancer Genome Atlas (TCGA), publication supplements and direct submissions. For the majority of the 
included datasets, probe level and integrated visualization facilitate gene level and genome wide data re- 
view. Results from multi-case selections can be connected to downstream data analysis and visualization 
tools. 

Conclusions/Significance: To our knowledge, currently no data source provides an extensive collection 
of high resolution oncogenomic CNA data which readily could be used for genomic feature mining, across a 
representative range of cancer entities. arrayMap represents our effort for providing a long term platform 
for oncogenomic CNA data independent of specific platform considerations or specific project dependence. 
The online database can be accessed at http://www.arraymap.org. 

Author Summary 
Introduction 

Genomic copy number abnormalities (CNAs) are a relevant feature in the development of basically all 
forms of human malignancies [1]. Many genomic imbalances are recurrent and display tumor-specific 
patterns [2,3]. It is believed that these genomic instabilities reveal mutations in tumor suppressor genes 
and oncogenes which eventually result in a clone of fully malignant cells. Investigation of CNA hot 
spots (chromosomal loci frequently involved in CNA) has proven to be an effective methodology to 
identify novel cancer-causing genes [4,5]. On a systems level, CNA data along with expression or somatic 
mutation data is used to detect pathways altered in cancers and to deduce functional relevance of pathway 
members [6,7]. Since many CNAs have been attributed to specific tumor types or clinical risk profiles, 
in some entities copy number profiling is employed to characterize different biological as well as clinical 
subtypes with implications for treatment and individual prognosis. Subtype-associated CNA regions are 
used to predict causative genes, furthering understanding of biological differences and leading to discovery 
of new therapeutic targets [8,9]. 
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Throughout the last two decades, molecular-cytogenetic techniques have been applied to scan ge- 
nomic copy number profiles in virtually all types of human neoplasias. For whole genome analysis, these 
techniques predominantly consist of chromosomal and array comparative genomic hybridization (CGH), 
including CNA detection by cDNA and single nucleotide polymorphism (SNP) arrays [10-12] 1 . While 
chromosomal CGH has a limited spatial resolution of several megabases, the resolution of recent array 
based technologies (aCGH) is mainly limited due to cost /benefit evaluations instead of technical obstacles. 

The flood of new insights into structural genomic changes in health and disease has led to an increased 
interest in genomic data sets in genetic and cancer research. Several systematic studies of CNAs across 
many cancer types have been performed [13,14]. These efforts attempt a more complete understanding 
of functional effect of CNAs in the context of cancer. 

The exponential increase of high resolution CNA datasets offers new challenges and opportunities for 
large-scale genomic data mining, data modeling and functional data integration. Several online resources 
have been developed, focusing on different aspects of data content as well as representation [6, 15-19]. An 
overview of some of the prominent examples is given in Table 1. In principle, these databases facilitate 
access and utilization of CNA data. However, they are limited to specific aCGH platforms and/or 
single institutions as well as limited disease categories, or, as in the cases of GEO [15] and Ensembl 
ArrayExpress [16], mainly serve as raw data repositories. To the best of our knowledge, no single data 
source does yet provide an extensive collection of high resolution oncogenomic CNA data which readily 
could be used for genomic feature mining, across a representative range of cancer entities. 

Here we present " arrayMap" , a web-based reference database for genomic copy number data sets in 
cancer. We have generated a pipeline to accumulate and process oncogenomic array data into a unified 
and structured format. The resource incorporates associated histopathological and clinical information 
where accessible. 

So far, arrayMap contains more than 40,000 arrays on 224 cancer types from five main data sources: 
NCBI GEO, EBI ArrayExpress, The Cancer Genome Atlas, publication supplements and user submitted 
data. Samples of interest can be browsed, visualized and analyzed via an intuitive interface. Computa- 
tional tools are provided for biostatistical data analysis such as CNA clustering for case specific or for 
subset data and basic clinical correlations. arrayMap is publicly available at www.arraymap.org. 

Results 

Data content 

Our combination of both "top-down" (publication driven) as well as "bottom-up" (array data driven) 
approaches allowed us to identify a comprehensive set of accessible aCGH based cancer CNA data sets 
and to estimate the ratio of accessible data of the overall published/deposited data. 

As main result of the array data driven approach, we extracted 495 series comprising of 32002 arrays, 
generated on 237 platforms from NCBIs GEO. Among those, raw data files of approximately 29000 whole 
genome arrays were suitable for inclusion into our data processing pipeline. When reviewing the content 
of AE, we found that the majority of AE cancer genome data sets were also submitted to GEO. At the 
time of writing, 11 datasets including 712 arrays not present in GEO had been processed based on AE 
specific series. Detailed information on the GEO/AE data sets is provided in supplementary Table SI. 

The top-down procedure was based on our group's continuous monitoring of cancer related arti- 
cles utilizing genome copy number screening approaches, as established for our "Progenctix" project 
(www.progenetix.org; [19]). The census date for the literature based data collection was August 15 2011. 
At this point, we had identified 931 articles discussing a total of 53213 genomic cancer CNA profiles based 

In this article, we use the terms "array CGH" and "aCGH" for all technical variants of whole genome copy number arrays. 
This includes e.g. single color arrays for which regional copy number normalization is performed through bioinformatics 
procedures applied to external references and internal data distribution. 
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Table 1. Prominent online resources of genomic data 



Name 


Address 


Platform(s) 


Data format 


Comment 


GEO [15] 


www.ncbi.nlm.nih.gov/geo 




263 


raw and normalized 
probe signal intensity 


largest microarray data 
repository 


Array Express* 
[16] 


www .ebi.ac.uk /arrayexprcss 




16 


raw and normalized 
probe signal intensity 


many duplicate data in 
GEO 


TCGA [6] 


cancergenome . nih . gov 




1 


segmentation data 


raw probe data is limited 
to download 


CanGEM** [17] 


www.cangem.org 




38 


normalized probe signal 
intensity 


including many types of 
microarray data 


CaSNP [18] 


cistrome.dfci.harvard.edu/CaSNP 


8 


average copy number 
and graphic 


focus on SNP array data 


Progenetix [19] 


www .progenetix. org 




235 


ISCN*** and golden 
path 


data from publications 



Data up to 29 April, 2011 

^excluding data both in GEO and ArrayExprcss 

** statistical information only including CGH, SNP and cDNA data 
*** International system for human cytogenetic nomenclature 



on aCGH techniques. Of these, 8728 cases out of 199 articles so far had been extracted from publication 
related sources (e.g. supplementary data tables) and annotated and made been accessible through Pro- 
genetix. This data included cases for which only supervised information but no probe data was available 
(e.g. author annotated Golden Path or cytogenetic CNA regions). Literature based data sets containing 
probe specific data or with the respective data presented to us by the authors (640 samples) were included 
into our arrayMap data processing pipeline. 

The data content of arrayMap is summarized in Table 2. Current numbers on the website will include 
changes based on ongoing annotation efforts (i.e. addition of data sets, removal of low quality arrays). 

As a by-product of our data collection and annotation efforts, we are able to provide estimates of 
content and trends for the platform usage and cancer entity coverage for the majority of published data. 
According to the assigned ICD-0 3 (International Classification of Diseases for Oncology, 3rd Edition) 
code and descriptive diagnostic text, breast carcinoma predominates as single largest clinical entity with 
6459 arrays. Supplemental Table S2 presents sample sets in arrayMap classified by ICD-0 code. 



Table 2. 


aCGH data integ 


rated in 


arrayMap 




Data Source 


Arrays 


Cases 


Scries 


Platforms 


Publications 


GEO 


32002 


25728 


495 


237 


490 


ArrayExpress 


712 




11 


16 


11 


TCGA 


7249 


3594 


19 


1 


* 


Publication Supplements 


>4578** 


4578 






137 


Author Submission 


556 


539 


8 


7 





Data up to 29 April, 2011 

* Due to lack of publication information, there may be a small amount of duplicate data in GEO 

**Array number may be higher than case number since reported results per case occasionally may be based on more than one 
array. The number does not include data presented both in publication supplements as well as GEO. 
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□ BAC/P1 

■ in situ oligonucleotide 

■ oligonucleotide beads 

■ spotted DNA/cDNA 

□ spotted oligonucleotide 



GEO arrays, 2001-201 1 (sorted by accession no.) 

Figure 1. Distribution of resolutions and techniques of GEO platforms. Each point represents a 
genomic array. The Y axis is labeled with probe number in log scale. The X axis denotes the time 
sequence of array data generation. From left to right are years from 2001 to 2011. 

The most widely available array CGH platforms are either based on large insert clones (B AC/PI 
arrays) or based on shorter single-stranded DNA molecules (oligonucleotide arrays) , which may or may not 
include single-nucleotide polymorphism specific probe sequences (SNP arrays). Also, although designed 
for gene expression profiling, cDNA arrays were used by several laboratories for measuring genomic copy 
number changes. Although all these platforms are considered suitable for whole genome CNA analysis, 
their probe densities and other parameters can affect specific features of the analysis results [20-23]. 
Table S3 lists the general platform types and corresponding overall numbers of the data registered in 
arrayMap. 

In reviewing the technical platform composition, two related trends become apparent (Figure 1). 
Originally developed in groups with expertise in molecular cytogenetics and cancer genome analysis, 
printed large insert clone arrays (BAC/P1) were the first whole genome CNA screening tools with a spatial 
resolution surpassing that of chromosomal CGH. Other groups re-employed cDNA arrays, developed for 
expression screening, for genomic hybridizations. However, over the last years one can observe the 
overwhelming use of various industrially produced oligonucleotide array platforms, which compensate 
their low single probe fidelity through a probe density at 1-3 orders of magnitude higher than common 
for BAC/P1 arrays. Another reason for the success of oligonucleotide arrays is the integration of SNP 
specific probes, which in principle allows to use of the same experiments for genetic association studies 
and the evaluation of copy number neutral loss of heterozygosity regions [12, 24, 25] 

Data access and usage scenarios 

Based on our experience from the Progenetix project, a strong emphasis was put on a user friendly data 
interface. Here, we followed a "dual user type" scenario: Users without bioinformatics background should 
be able to intuitively visualize core data features as well as to perform standard analysis procedures, while 
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for bioinformaticians the formatted database content should be accessible to use with their analysis tools 
of choice. 

Query interface. Data browsing in arrayMap is based on two types of query methods: search by 
experimental series metadata and search by array features. 

In the series query form, users can perform various search options by specifying (i) PubMcd ID; (ii) 
series ID; (hi) platform ID; (iv) platform descriptions and (v) descriptive diagnosis text. For array specific 
queries, additional features are available: array ID; disease classification (ICD-0 3 code and text) as well 
as disease locus (code and text) and single or combined regional CNA. 

In the results table, associated array information is displayed. A number of links to additional and/or 
outside data is provided, according to the information available: the corresponding PubMed entries; the 
original GEO / AE accession display page for more complete information; the case and publication entries 
on the Progcnctix website for further analysis; and importantly the array specific data visualization page. 

Array probe data visualization. In the array plot interface, original plots of genomic array data 
sets can be searched and visualized (Supplemental Figure SI). Default threshold parameters which 
were either provided with the data or assigned during the initial visualization will be loaded. In single 
array visualization, the general view of probe distribution and post-thresholding segmentation results 
are displayed for the whole genome as well as for each individual chromosome. If multiple arrays are 
retrieved, users can select sample data for downstream analysis procedures. Supplemental Figure S2 
shows the screenshot of single array visualization. 

Users can segment the raw data values and re-plot the results after revising the following parameters: 

• Golden path edition, default HG18/NCBI Build 36. This is still the commonly used version of the 
human reference genome assembly. At the moment, coordinates of probes from all platforms were 
remapped to HG18. For the near future , we intend to allow for a selection of updated genome 
editions. 

• Chromosomes to plot, default 1 to 22. Single or all chromosomes can be selected for re-plotting. 
To avoid gender bias, most platforms do not contain probes in chromosome X and Y during the 
design. 

• Loss/gain thresholds. Cut-offs from which a segment is considered a genomic loss or gain. The 
optimum thresholds may vary between platforms. 

• Region size in kb. Sets a filter to remove CNA below (e.g. probable noise) or above (e.g. exclude 
non-focal CNA) a certain size range. 

• Minimal probe numbers for segments. This parameter can be used to limit the minimal number of 
probes required for a segment to be considered (e.g. to remove aberrant segmentation due to probe 
level noise). Empirical examples would be values of 2-3 for high quality BAC arrays and 6-10 for 
Affymetrix SNP 6 arrays. 

• Plot region. Single genomic region to be plotted, overriding the chromosome selection above. 
When selected, plots with this region will be generated for all current arrays. This is valuable 
to e.g. display the CNA status and copy number transition points for specific genes of interest 
(Supplemental Figure S3). 

Zoom-in visualization of focal CNA. Figure 2 shows the visualization of focal genomic imbalances, 
e.g. to identify genes of interest targeted by focal CNA. The whole genome view of GSM535547 (human 
high grade glioma sample analyzed by Agilent Human Genome CGH Microarray 244A) shows a small 
regional deletion in chromosome 9p21. When plotting the approximate locus of the deletion (specified as 
chr9:21600000-22400000), genes, probes and chromosome bands in this zoomed in region are shown. Two 
genes, MTAP and CDKN2A can be seen as being localized in a potential homozygously deleted region. 



() 




Figure 2. Zoom-in visualization of focal CNA. (A) GSM535547 (human high grade glioma, Agilent 
CGH 244A) shows high quality of probe hybridization signal. CNAs are easy to distinguish. (B) When 
zoom-in the whole chromosome 9, an approximately 80 MB deletion is displayed, with two breakpoints 
located in p and q arm respectively. In addition, a small regional deletion in 9p21 is quite clear. Color 
bars in lower region of the panel represent 848 genes located in chromosome 9. (C) Zoom in the 
potential homozygously deleted region in 9p21 by specifying the exact region: chr9:21600000-22400000. 
The zoomed-in plot shows probes, chromosome band and two tumor suppressor genes, MTAP and 
CDKN2A. Gene name and location will be given while mouse hover. They link to UCSC genome 
browser with additional information. 
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The focal deletion of these known tumor suppressor genes [26,27] points to their specific involvement in 
the glioblastoma sample analyzed here. 

Querying compound CNA. The concept of focal CNA detection can be integrated with a global 
search for arrays containing gene specific regional imbalances. As an example, we demonstrate the search 
for arrays displaying imbalances in 4 gene loci associated with glioblastoma: EGFR, a transmembrane 
receptor and proto-oncogene [28]; PTEN, a tumor suppressor gene [29]; ASPM, frequently overexpressed 
in glioblastoma relative to normal brain tissue [30]; and CDKN2A (see above). In the "Search Public 
Arrays" form, the "Match ..." can be used to specify the genomic regions of those four genes including 
the expected CNA type: for EGFR (chr7:55054219-55242524:l), PTEN (chrl0:89613175-89718511:-l), 
ASPM (chrl:195319885-195382287:l) and CDKN2A (chr9:21957751-21984490:-l), respectively. 
When executing the query, these regions were matched with the whole database and returned cases which 
have imbalances overlapping all these regions. When excluding controls and "worst quality" datasets, 303 
out of 42421 arrays could be identified matching all four CNA regions. In addition to glioblastoma, several 
other types of cancer cases were among the results, including e.g. neuroblastomas, breast carcinomas, 
melanomas and lung carcinomas, which is in accordance with some previous observations [31-34]. CNA 
and associated data of those cases can be processed by online tools for further analysis and visualization 
(Supplemental Figure S4) or downloaded for offline processing. 

Copy number profiling of selected cancer entities. One aim of arrayMap is to allow researchers 
to conveniently perform aCGH meta-analysis across different platforms. By selecting a single or several 
cancer entities e.g. based on their ICD entity codes or diagnostic keywords, users are able to generate 
disease specific CNA frequency profiles or to compare profiles of different cancer types. 

As an example, we used ICD-0 code 9440/3 (glioblastoma, NOS) to query the database. 1478 ar- 
rays from 25 publications were returned and passed to our suite of online analysis tools. Chromosomal 
ideograms and histograms were generated representing the frequency of copy number aberrations iden- 
tified over the whole dataset (Figure 3A). In the overall aberration profile, the most common genomic 
imbalances included whole chromosome 7 gain and chromosome 10 loss, as well as focal gains e.g. on 
bands lq21 and 17q21. In our example dataset, a prominent focal deletion hot-spot was centered around 
9p21.3 (921 of 1478 arrays, 62.31%) which had been discussed previously [35]. The distribution of CNAs 
over the individual arrays was visualized through a matrix plot (Figure 3B). As additional information 
to the frequency histograms, this form of visualization facilitates e.g. the detection of CNA patterns 
among individual arrays as well as the concordance of individual CNAs (e.g. here the arm-level changes 
in chromosome 7 and 10). 

In the matrix plot, clicking on a certain segment would open the related view in the UCSC genome 
browser [36], for detailed information related to this genomic region (SVG plot only). The plot order of 
arrays can be re-sorted according to ICD morphology, ICD topography, clinical group or PubMed ID, 
which can be helpful in associating CNA patterns to external classification categories. For the selected 
classification criterium (default: ICD morphology), regional CNA frequencies for cases matching the 
different values will be visualized through a heatmap (Figure 3C); this feature is especially useful when 
comparing a number of different primary classification criteria. 

An overall genomic copy number profile of cancer 

Our high quality core dataset in arrayMap was used to generate an overall cancer copy number aberration 
profile based on 29,137 arrays (Figure 4). 

This data represented 177 cancer types according to ICD-0 3 code, with 59 types among them 
contained more than 50 arrays. Overall, one of the most common genomic alteration is copy-number 
gain of chromosome band 8q24, which is found in 30% of total samples. According to the COSMIC [37] 
database, the most significant cancer gene in this region is MYC. It is a well-documented oncogene codes 
for a transcription factor that is believed to regulate the expression of 15% of all genes, including genes 
involved in cell division, growth, and apoptosis [38,39]. Other common imbalances observed in at least 
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25% of oncogenomic arrays included gains of regions on e.g. 17q21 (29%), lq21 (33%) and loss of regions 
on e.g. 8p23 (32%) and 9p21 (25%), including focal deletions of the CDKN2A/B locus (Figure 2). 

While the overall CNA frequency distribution points towards DNA features targeted in multiple 
entities, this information is insufficient for deriving molecular mechanisms associated with specific cancer 
types. The genomic heterogeneity of different neoplasias is reflected in the varying patterns of regional 
CNA frequencies. Based on our core dataset, we have generated a heatmap-style visualization of frequency 
profiles for all ICD-0 entities containing more than 50 arrays (Supplemental Figure S5). The striking 
patterning of the CNA profiles indicates the non-random occurrence of CNAs, and should be seen as 
an invitation to explore e.g. CNA similarities shared by separate histopathological entities, as a way to 
transpose knowledge about pathophysiological mechanisms. 
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Figure 3. Copy number profiling of glioblastoma. (A) Chromosomal ideogram and histogram showing 
frequency of copy number aberrations. Percentage values corresponding to gains (yellow) and losses 
(blue) identified over the whole dataset. The most frequent imbalances include gain of chromosome 7 
and loss of chromosome 10, 9p21.3. (B) Matrix plot of 1478 glioblastoma cases. The Y axis represents 
individual samples. The distribution of genomic copy number imbalances reveals the individual 
aberration patterns of glioblastoma. (C) Heatmap of regional CNA frequencies for 1478 arrays. The 
intensity of green and red color components correlates to the relative gain and loss frequencies, 
respectively. If dataset contains cancer subtypes, cancers with similar CNA frequency profiles will be 
clustered together, such that differences between subtypes will be revealed (e.g. see Figure S4H). 
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Figure 4. The overall cancer copy number aberration profile consisted of 29f 37 arrays. This plot 
represents 177 cancer types according to ICD-0 3 code. Percentage values in Y axis corresponding to 
numbers of gains (green) and losses (red) account for the whole dataset. 

Discussion 

arrayMap was developed to facilitate the progress of oncogenomic research. Our aim is to provide high- 
quality genomic copy number profiles of human tumors, along with a set of tools for accessing and 
analyzing CNA data. The service has been implemented with a straightforward web interface, including 
search options for CNA features and clinical annotation data. All assembled datasets are processed 
into platform independent segmentation and, for the vast majority of arrays, probe level data files, and 
are presented in consistent formats. Importantly, the direct access to precomputed probe level data 
plots supports a rapid evaluation of experiments for features of interest. As a curated database using 
standardized annotation schemes (e.g. ICD classification), arrayMap facilitates the exploration of cancer 
type specific CNA data, as well as the statistical association of genomic features to clinical parameters. 

arrayMap is a dynamic database that is being continuously expanded and improved. We will review 
the existing and newly published articles to update the database periodically. Over the past decade, we 
have witnessed a rapidly increasing number of aCGH publications, which gives us sufficient evidences to 
anticipate that cases in our database will continue to be deposited at a high rate. Although arrayMap is 
not a user driven repository, we welcome and support users interested in using the site for yet undisclosed 
data, if they agree on data sharing upon publication. 

Although, in contrast to the continuous data from expression analysis, copy number analysis explores 
discrete value spaces (countable number of DNA copies, for segments defined by genomic base positions), 
interpretation of the data can vary due to different low level (e.g. signal/background correction) and 
higher level (e.g. segmentation algorithms, regional or size based filtering) procedures. In that respect, 
we have to emphasize that the results of our data processing and annotation procedures are open to 
scrutiny. We encourage a critical review of individual results, and are open for suggestions regarding 
improved processing procedures for specific platforms. 

In this paper, we have provided example scenarios of using arrayMap on different levels, i.e. locus 
centric and for entity profiling. We believe that systematic analyses will help researchers to discover 
features which are indiscernible in individual studies, and thus bring tremendous new insights for under- 
standing of disease pathology and inspire the development of new therapeutic approaches. Researchers 
can integrate arrayMap data with their own analysis efforts, e.g. to increase sample size or for result 
verification purposes. 

We hope that this database will promote further evolution of microarray data meta-analysis. Ar- 
rayMap provides access to more than 200 tumor types, which makes it suitable for research across cancer 
entities. Furthermore, normal sample controls are of vital importance for genomic imbalances studies. 
ArrayMap includes more than 3000 normal samples from healthy individuals or from normal tissues of 
cancer patients. These data could be integrated as reference dataset e.g. to account for copy number 
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variation data superimposed on the tumor profiling results. 

In the near future, with the continuous accumulation of very high resolution CNA data from genomic 
arrays and whole genome sequencing experiments, it will become possible to integrate these data into 
systems biology methods to elucidate effects of genomic instability, and describe the results from more 
perspectives. Envisioned examples would be e.g. the identification of genes that are involved in metastasis 
and treatment response; identification of chromosomal breakpoints distribution in cancer; and modeling 
functional networks in cancer by systems biology approaches. 

Materials and Methods 
Dataset collection 

Raw experimental data from a variety of platforms and repositories were extracted. They were con- 
verted to an uniform format which is suited to our reanalysis and visualization system. After a series 
of parsing procedures, the called copy number data is stored in arrayMap. The flowchart of arrayMap 
data collection and analysis is as shown in Figure 5. Five main data sources are integrated into arrayMap: 

GEO/AE. For extracting appropriate data Series from GEO/AE, two basic criteria have to be ful- 
filled. First, the raw data has to be from human malignancies analyzed by BAC, cDNA, aCGH or 
oligonucleotide arrays. Second, the array platform must be genome wide, with the optional omission of 
the sex chromosomes. Chromosome or region specific arrays were excluded because they were not able 
to reveal the whole genomic profile of the respective cancer. Associated clinical data was extracted if 
available. 

TCGA. Segmentation data with available clinical information was extracted and incorporated into 
the database. Due to data sharing restrictions, TCGA data is an exception in that, so far no probe 
level data is incorporated into arrayMap. This exception was accepted since users will be able to access 
individual TCGA datasets through the projects web portal at http://tcga-data.nci.nih.gov/tcga/. 

Publications. Many aCGH datasets can be found in the text or supplementary files of publications. 
In order to collect data from publications, we relied on our Progenetix projects setup. Data in Progenetix 
is manually curated. The collection strategies are: 

• literature mining using complex search parameters through PubMed 

• identification of called aCGH data, in GP annotation or tabular format (article, supplementary 
tables) 

• evaluation of supplementary files for probe specific data tables 

• follow-up on article links outs, to repository entries or referenced datasets 

User submission. User submitted data was provided in a number of formats which were converted 
to the standard format as described. Although we accept and support private datasets, we insist on 
integration of at least the genomic and core clinical data (e.g. disease classifiers) upon publication of the 
datasets analysis results. 

Dataset analysis 

Probe remapping. A pipeline has been generated for determining the genomic positions for the tens to 
hundreds of thousands array probes with reference to a common genome Golden Path edition. For each 
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Figure 5. The flowchart of arrayMap data collection and analysis procedures. Publicly available raw 
data or segmented data was collected from respective data sources. Files were reprocessed by distinct 
procedures according to different data types. All kinds of probe signals were converted to log2 value. 
Probe coordinates were remapped to the most commonly used human reference genome assembly 
(NCBI Build 36/hgl8). At last, all information was converted to uniform format and stored in 
arrayMap, which is accessed by the web application. 



array platform, the genome positions of probes were remapped to the current commonly used version 
of the human reference genome assembly (NCBI Build 36.1/hgl8). Specific mapping procedures were 
employed for different types of probes. BAC clones were firstly remapped according to the clone sets 
information of Sanger/DECIPHER database [40]. If the probe position was not available, the UCSC 
Genome annotation database [36] (release hgl8) was used for compensation. After these two steps, a 
mean of 98% of the BAC clones were remapped. For IMAGE clone sets, only the UCSC Genome anno- 
tation database was used. The average remapping rate of IMAGE clones was 91%. Affymetrix raw CEL 
data files were analyzed based on hgl8 library files, namely the output segments have hgl8 coordinates. 
The summary of the percentage of mapped probes is given in Table 3. The mapping details for each 
platform can be found in the supplementary data (Table S4). 

Probe signal normalization. The array data available was given in a variety of formats, most 
frequently as log2 ratio of probe hybridization intensity. In order to make data from different platforms 
directly comparable, all other types of normalized values were converted to log2. For dye swap experi- 
ments, reference/ tumor intensity ratios data was "reversed" representing a tumor/reference value. For 
some two-color arrays for which only raw signal intensity were provided, the normalized log2 ratio for 
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Table 3. Percentage of remapped probes according to platform types 



Platform type 


Average mapping rate 


Number of arrays 


Number of GPLs 


Original HG18 (Build 36) 


NA 


1583 


40 


in situ oligonucleotide 


99% 


21678 


55 


BAC/P1 


98% 


5464 


55 


spotted DNA/cDNA 


91% 


2365 


82 



each probe was calculated by: 
r = log2((T 5 - T 6 ) / (R.-R*)) 

where T s and T(, represent tumor sample intensity and tumor channel background intensity respec- 
tively, and R s and Rfc represent reference sample intensity and reference channel background intensity 
respectively. 

If multiple instances of the same clone exist, the average signal intensity of the certain clone was 
considered. 

Affymetrix genotyping arrays. For the widely used Affymetrix GenomeWide SNP arrays, raw 
CEL files were downloaded and underwent a massive re-analysis using the R package aroma. affymetrix [41] 
with the CRMAv.2 method [42]. During the processing step, approximately 50 normal sample arrays 
were employed as a reference set for each array type to reduce the noise level. Normal tissue arrays from 
different labs were extracted and used to build the reference datasct. In order to obtain high quality 
arrays, we excluded arrays which contain segments greater than 3 mega-bases, since copy number varia- 
tions are always smaller than 3 mega-bases. The list of normal tissue reference arrays is giving in Table 
S5 in Supplementary data. 

Quality control. In our review of array data deposited in GEO or collected from publication 
supplements we encountered a large number of individual data sets with insufficient or limited probe 
quality. Also, for samples of unprocessed raw data (e.g. Affymetrix CEL files), we found that QC 
measures reported previously (e.g. call rate [43], NUSE [44], RLE [44]) only had a limited accuracy for 
detection of arrays with inadequate probe level data. Currently, the most viable strategy for quality 
assessment of processed, heterogeneous copy number arrays is the visual inspection of probe plotting and 
segmentation results through an experienced researcher. For the first arrayMap edition we generated a 
quality classification system, which contains a total of 4 categories based on inspections of genome-wide 
array plots: 

• Excellent. Probe signal distribution is significantly different between normal regions and imbalance 
regions. Signal baseline is distinct and unique, making segmentation threshold realistic appearing. 
Chromosomal changes are pretty clear. 

• Good. In general good quality. Probe signal may contain some noise, but tolerable. Chromosomal 
changes are distinguishable. 

• Hypersegmented. Serrated distribution of probe signal intensities, causing dozens of separate peaks 
and discontinuous segments. Chromosomal changes are always up to several hundreds and smaller 
than 5 mega-bases. 

• Noisy. Probe signal intensities are highly scattered, but well-distributed, with high standard devi- 
ation, resulting in the inability to differentiate copy number changes. 
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Depending on the intended research purpose this basic classification system can be used for a pre-analysis 
triage of copy number data. Applying stringent review criteria we identified a core dataset with "excellent" 
quality arrays accounting for approximately 60 percent of total arrays. 

We are currently working on a platform independent quality assessment system for genomic arrays, 
which will be implemented in future versions of the arrayMap resource. 

Associated data. For arrayMap, data is stored with separate datasets for each array. This is in 
contrast to the Progenetix database, for which technical replicates where available are combined into 
case specific CNA profiles. In arrayMap, technical replicates are assigned an identical case identifier to 
facilitate downstream statistical procedures including e.g. clinical data correlations. The assignment 
of the correct diagnostic entity to each sample is an essential step in generating a binding between 
genomic and associated data points. At the same time, to ensure annotation consistency and make the 
retrieval process more efficient, for all CNA profiles the following data points were manually collected 
from GEO/Array Express and published papers if available. 

• Descriptive diagnostic text, as available through the original source 

• Diagnostic classification according to the International Classification of Diseases in Oncology (ICDO 
3, morphology with code) 

• Tumor locus according to ICD (ICD topography with code) 

• Source of material (e.g. primary tumor, cell line, metastasis) 

• Clinical parameters where available, including age, gender, grade, clinical stage (TNM coded), re- 
currence/progression, time to recurrence/progression, death and followup 

Web Server An online interface of arrayMap database was created using Perl common gateway 
interface (CGI) and R scripts running on a Mac OS X Server. Data is stored as flat files in the JSON 
format. Precomputed array plots are stored in SVG and PNG versions. The online release of the service 
has been optimized to be compatible with major browsers supporting current web standards (CSS2, 
HTML5, XML with inline SVG; e.g. Safari >= 3.0, Fircfox >= 3.0, InternetExplorcr >= 9, Cromc) 
with limited fallback support. Dynamic graphics provided in the array plot module were implemented 
by technologies including XML/XHTML, JavaScript, SVG and HTML5 Canvas. 
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Supporting information 

Figure SI. Array data sets visualization. Original plots and optimized parameters for GSE21530 
which contains 8 intimal sarcoma samples hybridized on Agilent CGH Microarray 244A platform. The 
normalized probe signal log2 ratios and post-thresholding segmentation results for each array are intu- 
itively displayed. Genomic alterations are represented by horizontal green (gain) and red (loss) lines. 
Alterations defined here as regions with log2 ratio >0.15 or <-0.15. Simplified schemas of CNAs link to 
UCSC genome browser for further review. 

Figure S2. Screenshot of single array visualization. ArrayMap plots for GSM630977 (acute 
myelogenous leukemia). Besides the whole genome view, subviews of each chromosome are displayed 
as well. From these plots, different kinds of genetic variation events are clearly revealed, e.g. massive 
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genomic rearrangement in chromosome 6; arm-level gain of chromosome 8q and 3MB focal change around 
lp31.3. Through the "Plot Array Data" interface, users can segment the raw data values and re-plot the 
results with customized parameters. 

Figure S3. Plot single genomic region. In the Plot Array Data interface, input the precise lo- 
cation (chr5:1100000-1400000) in Plot Region field. Plots with this region were generated for all 8 arrays 
in the current series (GSE21530). In this region, there are 5 genes which are shown schematically as 
colored boxes. CNA status and copy number transition points for these genes are displayed. 

Figure S4. Compound CNA query. (A) Four gene loci associated with glioblastoma (EGFR, PTEN, 
ASPM and CDKN2A) were inserted into "Match regions" field. 303 out of 42421 arrays were returned. 
(B) Classification information of these 303 arrays were displayed and can be selected for the following 
analysis. (C) Statistical and plot parameters can be customized. Associated data was processed by online 
tools, and returned results included: (D) Chromosomal ideogram and (E) histogram, show frequency of 
copy number aberrations; (F) Matrix plot reveals the aberration pattern of selected arrays; (G) Array 
classification tree generated by hierarchical Ward clustering, arrays with similar frequency of CNA are 
part of the tree branch. (H) Heatmap of CNA frequencies clustered by clinical group. 

Figure S5. Heatmap of frequency profiles for 59 cancer types. Heatmap visualization of fre- 
quency profiles for all ICD-0 entities containing more than 50 arrays in our core dataset. Region specific 
gain/loss frequencies were mapped to 1MB intervals. The intensity of colors (green: gains; losses: red) 
corresponds to the relative frequency of CNAs for each interval. 

Table SI: Entities extracted from NCBI GEO and EBI ArrayExpress 

Table S2: Cancer entities grouped by ICD-O code 

Table S3: Platform type distribution in arrayMap 

Table S4: Probe remapping rate for platforms 

Table S5: Normal tissue reference arrays for Affymetrix platforms 



